library(here)
library(tidyverse)
#install.packages("patchwork")
library(patchwork)
#install.packages("ragg")
library(ragg)Exploring the semantics of emojis: analysis and visualisation of categorical data
To run the code of this chapter, you will need to install and load the following packages:
Chapter overview
The chapter will walk you through:
- overviewing and exploring the data of a study
- preprocessing data for analysis (including translation, ordering and categorisation of levels)
- analysing and interpreting frequencies of categorical variables
- visualising frequencies using a barplot
- inserting and displaying emojis in R and inside of plots
- assembling multiple plots into a patchwork and interpreting it
We will work with the data from this study by Fricke et al. (2024):
Fricke, L., Grosz, P. G., & Scheffler, T. (2024). Semantic differences in visually similar face emojis. Language and Cognition, 1–15. https://doi.org/10.1017/langcog.2024.12
Fricke et al. (2024) have made their data and their analysis code publicly available. You can access it at https://osf.io/k2t9p/. The data is contained in the file raw_data.csv.
1 Introducing the study
Face emojis stand in for facial expressions and thereby fundamentally contribute to the subtext of a text message. A few studies have investigated the relationship between emojis and the emotions they depict. However, as emojis are a relatively recently occuring phenomenon, there is still a lot to be discovered. In this chapter, we will look into the study by Fricke et al. (2024).
1.1 Deconstructing emojis into Action Units
Fricke et al. (2024) compared visually similar emojis using a face emoji annotation system developed by Fugate and Franco (2021). This annotation system is based on the Facial Action Coding System (FACS) for human faces invented by Ekman and Friesen (1978). Fricke et al. (2024) assigned numbers to human-like facial features such as eyebrows arched and eyes wide. These numbers are called Action Units, short AUs. As you can see in Figure 1, each emoji consists of several AUs:
Fricke et al. (2024) defined two different types of emoji pairs: In the AU+ condition, the visual difference between emojis corresponded to a difference between AUs. In the AU- condition, the visual difference did not correspond to an AU difference.
1.2 The experiment
Three AU+ and three AU- emoji pairs were created (see Figure 1). Each pair was assigned two contexts, with each context corresponding to the prominent usage or meaning of one emoji but not the other. For example, the contexts of the first pair are happiness and (cheeky) laughter. The contexts were assigned based on https://emojipedia.org and a previous norming study.
Four single-sentence narratives were created for each of the contexts. For an example, see Figure 2, translated from German below (taken from Fricke et al., 2024, p. 6):
Alex writes to his best friend Stefan:
I just learned that my cousin’s dog has his own advent calender.
Alex is amused. Which of the emojis matches the message better? 😄😁
Alex writes to his best friend Stefan:
I just learned that I won 500 Euro in the lottery.
Alex is overjoyed. Which of the emojis matches the message better? 😄😁
These stories were divided up into into four experimental lists of 12 items. Each list also contained 12 filler items, so that each participant saw 24 items. The participants were then asked to help choose the emoji that matched the context. Each participant saw each emoji pair twice. The rate with which the context-matching emoji was chosen was measured.
Fricke et al. (2024)’s central research question was: Do AU differences lead to differences in meaning between the two emojis of a pair? In line with the pictorial approach by Maier (2023), they predicted that small visual differences between emojis which correspond to human facial features (AU+) would be more semantically relevant compared to those that do not (AU-).
Q1. According to Fricke et al. (2024)’s hypothesis, which of these results would you expect from the experiment?
2 Exploring the relationship between gender and emoji understanding
Fricke et al. (2024) asked participants about their gender, their attitude towards emojis, how often they use emojis on WhatsApp and how well they think they understand emojis. They visualised the distribution of male and female gender for emoji use and emoji attitude as barplots:
The plots in figure Figure 3 show that women use emojis more often and have a more positive attitude towards emojis than men. We want to find out whether women also reported a higher level of emoji understanding than men. Our analysis will involve 3 steps:
- calculating the frequencies of the genders in the data
- calculating the frequencies of the different levels of emoji understanding for each gender
- visualising the frequencies in a barplot similar to the plots above
2.1 Importing the data
Before we start our analysis, we import the data using the here function (see https://elenlefoll.github.io/RstatsTextbook/6_ImpoRtingData.html 6.5 Importing data from a .csv file):
raw_data <- read.csv(file = here("data", "raw_data.csv"))As specified by Fricke et al. (2024), we filter out participants who exceed the maximum age of 35 years for all following analyses. We do this by using the function filter() and store the result in a new data frame called data:
data <- raw_data |>
filter(age <= 35)2.2 Gender frequency analysis
Let’s first get a general overview: How many men, women, and non-binary people participated in the study?
The relevant variable in the data set is called gender. However, you will see that the names of the different gender groups are in German. Before we start analysing, we should translate them into English. To figure out what the the labels of the different gender levels are, we use the levels() function. levels() applies to factors, so we first need to convert gender with the function as.factor():
data$gender <- as.factor(data$gender)Now, we can look at the levels of our factor gender:
levels(data$gender)[1] "divers" "männlich" "weiblich"
Using a combination of mutate() and recode(), we translate männlich to men, weiblich to women, and divers to non-binary:
data <- data |>
mutate(gender = recode(gender,
"männlich" = "men",
"weiblich" = "women",
"divers" = "non-binary"))
levels(data$gender)[1] "non-binary" "men" "women"
Now that the genders have English names, we want to determine how many men, women, and non-binary subjects participated. This is not straightforward, because the data frame contains 24 rows for each subject, as each participant saw 24 items (see Section 1.2). If we were to simply count the occurrences of men, women, and non-binary in the data, we would end up with 24 times the values of the frequencies.
To determine the actual gender distribution, we need to group the data according to the subjects’ unique IDs. To do this, we apply the group_by function to the submission_id variable. We then pipe count(gender) to this to count the genders by submission_id:
gender_count <- data |>
group_by(submission_id) |>
count(gender)The result is stored in a new data frame called gender_count. Now we can look at the distribution using the table() function:
table(gender_count$gender)
non-binary men women
3 109 47
Alternatively, we can use the distinct() function to keep only unique occurrences (exactly: the first unique occurrence) of each submission_id. The argument .keep_all is set to TRUE, which means that all other variables in the data frame are kept and not deleted:
gender_count <- data |>
distinct(submission_id, .keep_all = TRUE)
table(gender_count$gender)
non-binary men women
3 109 47
The mode (see https://elenlefoll.github.io/RstatsTextbook/8_DescriptiveStats.html 8.1.3 Mode) of the gender variable in the dataset is men, as you can see from the output. The gender distribution is very uneven: 109 men, 47 women, and 3 non-binary people participated in the study. This is likely to skew our visualisation.
Q2. Which of these problems are likely to occur if we plot emoji understanding by gender in a barplot with unequal group sizes?
To solve these problems, we will use the same strategies as Fricke et al. (2024). We will use relative rather than absolute frequencies to make sure that the numbers for the different genders are comparable. For our visualisation, we will have to exclude the very small group of three non-binary participants.
2.3 How well do the different genders understand emojis?
Next, we calculate the relative frequencies of the different levels of emoji understanding for each gender.
The variable we are interested in is called emoji_understanding. Just like with gender, we first have to do some data wrangling. We convert emoji_understanding as a factor to get its levels:
data$emoji_understanding <- as.factor(data$emoji_understanding)
levels(data$emoji_understanding)[1] "eher gut" "gut" "mittelmäßig" "sehr gut"
We translate mittelmäßig to moderate, eher gut to rather good, gut to good, and sehr gut to very good:
data <- data |>
mutate(emoji_understanding = recode(emoji_understanding,
"mittelmäßig" = "moderate",
"eher gut" = "rather good",
"gut" = "good",
"sehr gut" = "very good"))
levels(data$emoji_understanding)[1] "rather good" "good" "moderate" "very good"
The levels are still in the wrong order. We need to rearrange them in an ascending order from moderate to very good. To do this, we define a vector c("moderate", "rather good", "good", "very good"). Using the factor() function, we encode this vector as a factor:
data <- data |>
mutate(emoji_understanding = factor(emoji_understanding,
levels = c("moderate",
"rather good",
"good",
"very good")))
levels(data$emoji_understanding)[1] "moderate" "rather good" "good" "very good"
The levels now look correct, so we can determine the frequencies for the different gender groups within emoji_understanding. We could do this by simply cross-tabulating gender with emoji understanding (see https://elenlefoll.github.io/RstatsTextbook/8_DescriptiveStats.html 8.1.3 Mode). But since we know that the sizes of the gender subsets are very unequal, we also want to calculate the relative frequencies to make the numbers comparable. There is an easy way to calculate relative frequencies using the proportions() function (see https://elenlefoll.github.io/RstatsTextbook/8_DescriptiveStats.html 8.2.1 Distributions of categorical variables). However, we need to make two additional considerations:
- Our aim is to calculate proportions within groups and not across the whole data.
- We want to create a comprehensive visualisation that contains both groups of men and women in a single barplot.
To achieve both, we have to first group our data, using the powerful combination of group_by() and count(). We create a new data frame gender_understanding_count and again keep only each participant’s unique submission_id as above. We group the data by gender and count the frequencies for the different genders within the emoji_understanding factor:
gender_understanding_count <- data |>
distinct(submission_id, .keep_all = TRUE) |>
group_by(gender) |>
count(gender, emoji_understanding)If we check our data frame gender_understanding_count with View(), it looks like this:
n is calculated by the count() function and represents the number of occurrences for each combination of gender and emoji_understanding. Next, we add a column with the relative frequencies, which we calculate with the formula proportions(n) * 100:
gender_understanding_count <- data |>
distinct(submission_id, .keep_all = TRUE) |>
group_by(gender) |>
count(gender, emoji_understanding) |>
mutate(percentage = proportions(n) * 100)Finally, we print our frequency table:
print(gender_understanding_count)# A tibble: 10 × 4
# Groups: gender [3]
gender emoji_understanding n percentage
<fct> <fct> <int> <dbl>
1 non-binary rather good 1 33.3
2 non-binary good 2 66.7
3 men moderate 1 0.917
4 men rather good 25 22.9
5 men good 41 37.6
6 men very good 42 38.5
7 women moderate 1 2.13
8 women rather good 12 25.5
9 women good 13 27.7
10 women very good 21 44.7
From the frequency table, we can already see that non-binary participants reported either a rather good or good understanding of emojis. A higher percentage of women (44.7%) reported a very good emoji understanding compared to men (38.5%). But let’s create our barplot to see the distribution more clearly.
2.4 Data visualisation
As mentioned above, we filter out the very small group of non-binary participants for our visualisation:
gender_understanding_count <- gender_understanding_count |>
filter(gender != "non-binary")We use ggplot() to create a barplot with emoji_understanding on the x-axis and the relative frequencies on the y-axis. The bars are coloured according to gender:
ggplot(gender_understanding_count, aes(x = emoji_understanding,
y = percentage,
fill = gender)) +
geom_bar(stat = "identity", position = "dodge")To make our plot more meaningful, we add a title and labels. We also change the colours manually to make it look nicer. The hexadecimal color values chosen here are from the colour-blind friendly palette “Set2” from the package {RColorBrewer}. Since we only need two colours, we can easily insert them manually without having to install an additional package.
ggplot(gender_understanding_count, aes(x = emoji_understanding,
y = percentage,
fill = gender)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Self-reported Emoji Understanding by Gender",
x = "Emoji understanding",
y = "Percent") +
scale_fill_manual(values = c("#bc5090", "#009E73"))As you can see from the barplot, the gender distribution for emoji understanding is not as clear as for emoji use and emoji attitude.
Q3. How do you interpret this plot?
Interestingly, compared to Figure 3, women were more confident in reporting that they used emojis frequently and had a positive attitude towards emojis than they were in reporting that they understood emojis well. It is possible that some women were more modest in rating their understanding of emojis, which could indicate a gender confidence gap. Reporting a good understanding likely requires more confidence compared to frequent use or positive attitudes.
3 Comparing matching rates between AU conditions
We will now turn to exploring the central research question of Fricke et al. (2024). Remember, it was: Do AU differences lead to differences in meaning between the two emojis of a pair?
Step by step, we will build an informative plot which will include all the information needed to answer this question. This plot will display how many times each emoji was chosen in its presumed corresponding context.
Ideally, we would plot matching rates based on a variable that tells us exactly when the participants responded with the matching emoji. Unfortunately, Fricke et al. (2024)’s raw data do not include such a variable. We will tackle this by ourselves in this section.
3.1 Preprocessing the data
First, we need a variable that includes the experimental conditions of each trial. The experimental conditions tell us whether the visual difference between the presented emojis corresponded to an AU difference (AU+) or not (AU-), or whether a filler was presented.
We add a variable AU_difference that will include the necessary information. We do this by using a combination of mutate, case_when and str_detect:
data <- data |>
mutate(AU_difference = case_when(str_detect(name, "AU") ~ "AU+",
.default = NULL))The function mutate adds the column representing our variable. str_detect looks for a specific string in the column we specify. In our case, the column is called name, and the first string we need is "AU".
The above code command means: look for the string "AU" in the column name, and in all cases where you find it (case_when), add the string "AU+" to a new column called AU_difference. We do this for the other conditions as well:
data <- data |>
mutate(AU_difference = case_when(str_detect(name, "AU") ~ "AU+",
str_detect(name, "N") ~ "AU-",
str_detect(name, "filler") ~ "filler",
.default = NULL))A table checks that everything worked:
table(data$AU_difference)
AU- AU+ filler
954 954 1908
This looks promising. Since we are only interested in the experimental items, we now filter out filler trials:
data <- data |>
filter(AU_difference != "filler")We will now create another variable called context. The column of this variable will contain the contexts used in Figure 1. Again, we combine mutate, case_when and str_detect: In the column question, we look for context-characteristic strings, and add the context descriptions in case of a match. We check the output with table().
data <- data |>
mutate(context = case_when(str_detect(question, "freut sich") ~ "happiness",
str_detect(question, "lacht") ~ "(cheeky) laughter",
str_detect(question, "macht sich Sorgen") ~ "concern",
str_detect(question, "ist überrascht") ~ "surprise",
str_detect(question, "ist etwas genervt") ~ "mild irritation",
str_detect(question, "ärgert sich") ~ "annoyance",
str_detect(question, "amüsiert sich") ~ "amusement",
str_detect(question, "ist überglücklich") ~ "(intense) happiness",
str_detect(question, "ist enttäuscht") ~ "mild disappointment",
str_detect(question, "ist enttäuscht") ~ "moderate disappointment",
str_detect(question, "ist gut gelaunt") ~ "happiness2",
str_detect(question, "ist verlegen") ~ "bashfulness",
.default = NULL))
table(data$context)
(cheeky) laughter (intense) happiness amusement annoyance
159 159 159 159
bashfulness concern happiness happiness2
159 159 159 159
mild disappointment mild irritation surprise
318 159 159
Q4. Which problems become apparent when checking the output via the table?
The contexts mild disappointment and moderate disappointment have created some issues: Both are described by ist enttäuscht ‘is disappointed’. Except for their coding in the column name, their data appears to be identical. At this point, we have no choice but to look for additional disambiguating information in Fricke et al. (2024)’s analysis script: The emoji 🙁 (mild disappointment) is coded as N-36-L1 and ☹️ (moderate disappointment) as N-37-L1. Using this information, we redefine the two contexts:
data <- data |>
mutate(context = case_when(
str_detect(name, "N-36") ~ "mild disappointment",
str_detect(name, "N-37") ~ "moderate disappointment",
.default = context))
table(data$context)
(cheeky) laughter (intense) happiness amusement
159 159 159
annoyance bashfulness concern
159 159 159
happiness happiness2 mild disappointment
159 159 159
mild irritation moderate disappointment surprise
159 159 159
Finally, we add the critical variable that describes whether there is a match between the chosen emojis and the contexts: if the emoji and the context agree, the variable will have the value match. Otherwise, the value will be no match.
data <- data |>
mutate(
match = case_when(
context == "happiness" & response == "grinning_face_with_big_eyes" ~ "match",
context == "(cheeky) laughter" & response == "grinning_squinting_face" ~ "match",
context == "concern" & response == "hushed_face" ~ "match",
context == "surprise" & response == "astonished_face" ~ "match",
context == "mild irritation" & response == "neutral_face" ~ "match",
context == "annoyance" & response == "expressionless_face" ~ "match",
context == "amusement" & response == "grinning_face_with_smiling_eyes" ~ "match",
context == "(intense) happiness" & response == "beaming_face_with_smiling_eyes" ~ "match",
context == "mild disappointment" & response == "slightly_frowning_face" ~ "match",
context == "moderate disappointment" & response == "frowning_face" ~ "match",
context == "happiness2" & response == "smiling_face_with_smiling_eyes" ~ "match",
context == "bashfulness" & response == "smiling_face" ~ "match",
.default = "no match"))3.2 Building the plot(s)
We will now build our plots to visualise the matching rates per emoji pair. In a new dataframe called data_AU, we group the data by contexts. The command count(match) counts matches and non-matches for each context and stores them in the column n. We add the column percent which stores the rounded percentage of matches and non-matches for each context-pair:
data_AU <- data |>
group_by(context) |>
count(match) |>
mutate(percent = round(proportions(n)*100, 2))Using the View() function, we take a look at our data:
We plot the first emoji pair of the AU+ condition 😯 😲 with their respective contexts concern and surprise:
plot_concern_surprise <- data_AU |>
filter(context == "concern" | context == "surprise") |> #1.
ggplot(aes(x = context, y = percent, fill = match)) + #2.
geom_col() + #3.
scale_x_discrete(limits = c("concern", "surprise")) + #4.
scale_fill_manual(values = c("#66C2A5", "#FC8D62")) + #5.
geom_text(aes(label = percent), position = position_stack(vjust = 0.5)) + #6.
labs (x= "context", y = "percent", title = "😯 😲") + #7.
theme_bw() + #8.
theme(plot.title = element_text(hjust = 0.5), legend.title=element_blank()) #9.The above command creates a barplot and stores it in plot_concern_surprise. These are the steps:
- Filter the contexts, such that only rows of the contexts concern or (
|) surprise are taken into account. - Create a
ggplotwith context values on the x-axis and percentages scaled on the y-axis. Fill the area inside geoms with colour according to match values. - Display the plot as a barplot. By default,
geom_barcounts how many times match and no match occur. However, as we have already calculated and stored the values in the columnpercent, we usegeom_colto use the data as is. - The context values, which are displayed on the x-axis, are discrete. With this command, we set and order the contexts.
- Adjust colours with values from “Set2” from the {RColorBrewer} package (see Section 2.4).
- Annotate the percentages of matching rates by adding them as text and placing them inside the plot, in the middle of the according bar.
- Add labels: context on the x-axis, percent on the y-axis, and the corresponding emojis as the title.
- Add a theme, in this case
theme_bw(). - Plots are left-aligned by default. Since we want the emojis to be displayed on top of their corresponding context bars, we move the title to the center of the plot. We also remove the title of the legend because the match and no match values are self-explanatory.
Let’s take a look at our plot:
plot_concern_surpriseWe need not only one plot, but six: one for each emoji pair. We could write it all out for each emoji pair, but since the code is identical (except for the contexts and the emojis), it is way more efficient to define a function.
So far, we have only used built-in R functions, but have not defined our own functions. Functions are reusable code snippets that perform specific tasks. You have already learned about built-in R functions in https://elenlefoll.github.io/RstatsTextbook/7_VariablesFunctions.html 7.4 Using built-in R functions. However, for highly specific queries that are applied several times it makes sense to define a function. As a rule of thumb, whenever code seems redundant (you may find yourself copying and pasting a lot), it is best to define a function for that task.
The basic structure of a function is function(argument). Looks familiar? Accordingly, we define a function the following way: function(parameters){function body}
These are the steps:
- We define a function using the keyword
function. After this keyword, we write a list of parameters in parentheses. Parameters act as placeholders for the function’s arguments. - We then construct the function body and enclose it in curly brackets. The function body tells the function what it is meant to do when called upon.
- We also need to think of a name for our function under which we will be able to call it. The function is assigned to the name by
<-.
In our case, the process of defining a function is straightforward:
- We start with the keyword
functionand state that our function should takecontextsas its first argument andemojisas its second argument, as only these change with each plot. - We then simply paste the code we just wrote for our plot inside the curly braces, replacing the specific contexts and emojis with parameters.
- Our function is called
plot_AU_matchesbecause that is what it does: plotting AU matches.
plot_AU_matches <- function(contexts, emojis) {
data_AU |>
filter(context %in% contexts) |>
ggplot(aes(x = context, y = percent, fill = match)) +
geom_col() +
scale_x_discrete(limits = contexts) +
scale_fill_manual(values = c("#66C2A5", "#FC8D62")) +
geom_text(aes(label = percent), position = position_stack(vjust = 0.5)) +
labs (x= "context", y = "percent", title = emojis) +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5), legend.title=element_blank())
}We apply this function to all contexts and emoji pairs by filling them in as the arguments:
plot_concern_surprise <- plot_AU_matches(c("concern", "surprise"), "😯 😲")
plot_happiness_cheeky <- plot_AU_matches(c("happiness", "(cheeky) laughter"), "😃 😆")
plot_mild_irr_annoyance <- plot_AU_matches(c("mild irritation", "annoyance"), "😐 😑")
plot_mild_disapp_mod_dissap <- plot_AU_matches(c("mild disappointment", "moderate disappointment"), "🙁️ ☹️")
plot_amusement_int_happiness <- plot_AU_matches(c("amusement", "(intense) happiness"), "😄 😁")
plot_happiness2_bashfulness <- plot_AU_matches(c("happiness2", "bashfulness"), "😊 ☺️")There are various ways to insert emojis in R. The easiest is to use the emoji keyboard (see Figure 6 (a)). To open it on MacOS, use the keyboard shortcut Crtl + Cmd + Space or fn + e and on Windows Windows logo key + . (period). The emoji keyboard is also available in RStudio, if you go to the “Edit” drop-down menu and click on “Emojis & Symbols”. Alternatively, there are emoji libraries for R, for example {emo(ji)} developed by Wickham et al. (2024).
As we want to display emojis within plots, we need to pay even more attention to graphics. Emojis as part of plots created by ggplot cannot be displayed just like that. Additional problems can occur when rendering a Quarto or R Markdown document to HTML.
If displaying emojis as part of plots in RStudio does not work for you, you will need to use the high-quality graphics library “AGG” (“Anti-Grain Geometry”) or “Cairo” as a backend in RStudio. To do this, head to the “Tools” drop-down menu and click on “Global Options”. Then, go to the “Graphics” tab and select the “AGG” or “Cairo” option (see Figure 6 (b)).
To correctly render your Quarto document to HTML, you can use the {ragg} library developed by Pedersen and Shemanarev (2024). This library provides graphic devices based on AGG and includes advanced text rendering, with support for emojis. {ragg} can be used with knitr by using the following setup at the beginning of your document:
knitr::opts_chunk$set(dev = "ragg_png")3.3 Assembling with {patchwork}
By applying our self-defined function plot_AU_matches to all emoji pairs and contexts, we have created one barplot for each emoji pair. We will use the {patchwork} package (Pedersen, 2024) to assemble the plots, thereby creating an overview. As the name suggests, {patchwork} enables us to patch several plots together and arrange them nicely, so that the finished plot will be cohesive and informative.